BoΔaziΓ§i University ΒΆ
IE582 Homework I ΒΆ
Brief Summary about the homeworkΒΆ
In this homework, Borsa Istanbul stock prices dataset was analyzed. The dataset includes randomly selected stocks with close prices over given period. Thanks to the required steps in homework, detailed data analysis that is from descriptive analysis to dimensionality reduction via PCA method was occurred.
# Required libraries imported
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA as PCA_Function
from scipy.stats import kurtosis, skew, trim_mean
# Data read, if you would like to run the script in your local environment, please adjust DATA_LOCAL_PATH for your environment.
# Along the homework, only long format data was used to do proper data analysis.
DATA_LOCAL_PATH = "/home/ssc/Desktop/BOUN/Courses/Fall-23/IE 582/fall-23-ssehacirit/HW1/"
data = pd.read_csv(DATA_LOCAL_PATH + "all_ticks_long.csv.gz")
# timestamp manipulation to transform given timestamp format into pandas datetime format which is more useful than another.
data["timestamp"] = pd.to_datetime(data.timestamp).dt.tz_convert(None)
data
| short_name | timestamp | price | |
|---|---|---|---|
| 0 | AEFES | 2012-09-17 06:45:00 | 22.3978 |
| 1 | AEFES | 2012-09-17 07:00:00 | 22.3978 |
| 2 | AEFES | 2012-09-17 07:15:00 | 22.3978 |
| 3 | AEFES | 2012-09-17 07:30:00 | 22.3978 |
| 4 | AEFES | 2012-09-17 07:45:00 | 22.5649 |
| ... | ... | ... | ... |
| 2848025 | ZOREN | 2019-07-22 14:00:00 | 1.1800 |
| 2848026 | ZOREN | 2019-07-22 14:15:00 | 1.1800 |
| 2848027 | ZOREN | 2019-07-22 14:30:00 | 1.1800 |
| 2848028 | ZOREN | 2019-07-22 14:45:00 | 1.1800 |
| 2848029 | ZOREN | 2019-07-22 15:00:00 | 1.1700 |
2848030 rows Γ 3 columns
Task I - Descriptive AnalysisΒΆ
The dataset was analyzed in terms of some summary stats features like min, max, median, quartiles, std etc. Main idea in this step is feeling the structure of the dataset, understanding distribution of given variables and calculating measures of central tendency, dispersion etc.
# Because the dataset is long format, with the groupby function of pandas, calculations were easily made.
# Required information were provided below.
data.groupby("short_name").apply(lambda x: pd.Series({
"N": x["price"].count(),
"minDate": x["timestamp"].min(),
"maxDate": x["timestamp"].max(),
"min_price": x["price"].min(),
"q25_price": x["price"].quantile(0.25),
"q50_price": x["price"].quantile(0.5), # also known as 'median',
"q75_price": x["price"].quantile(0.75),
"max_price": x["price"].max(),
"mean_price": x["price"].mean(),
"std_price": x["price"].std(),
"skew_price": skew(x["price"], nan_policy="omit"),
"kurtosis_price": kurtosis(x["price"], nan_policy="omit"),
"trim-mean_price": trim_mean(x["price"], proportiontocut=0.05),
"mad_price": abs(x["price"] - x["price"].mean()).mean(),
"range_price": x["price"].max() - x["price"].min()
}))
| N | minDate | maxDate | min_price | q25_price | q50_price | q75_price | max_price | mean_price | std_price | skew_price | kurtosis_price | trim-mean_price | mad_price | range_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| short_name | |||||||||||||||
| AEFES | 48131 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 19.16050 | 20.6465 | 22.73200 | 28.5090 | 20.982235 | 2.494002 | 0.306407 | -0.445736 | 20.940443 | 2.041743 | 28.5089 |
| AKBNK | 49209 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.85000 | 6.3057 | 6.93250 | 9.2124 | 6.473105 | 0.944955 | 0.579306 | -0.039912 | 6.446953 | 0.732379 | 9.2123 |
| AKSA | 48594 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.20880 | 6.9853 | 8.72000 | 15.1189 | 7.127504 | 2.710033 | 0.169080 | -0.101457 | 7.100438 | 2.123153 | 15.1188 |
| AKSEN | 48171 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 2.67000 | 2.9300 | 3.75000 | 5.1900 | 3.183542 | 0.724332 | 0.825970 | -0.353334 | 3.144727 | 0.603710 | 5.1900 |
| ALARK | 48335 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.56890 | 1.9376 | 2.42140 | 3.5143 | 2.060859 | 0.575943 | 0.645715 | -0.533558 | 2.035536 | 0.473908 | 3.5142 |
| ALBRK | 46862 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 1.0255 | 1.22510 | 1.3602 | 1.50000 | 2.1900 | 1.365549 | 0.167824 | 0.391135 | 0.186814 | 1.361010 | 0.140982 | 1.1645 |
| ANACM | 48165 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.04700 | 1.2597 | 2.40210 | 3.5021 | 1.672102 | 0.788365 | 0.712827 | -1.049554 | 1.633785 | 0.702866 | 3.5020 |
| ARCLK | 49045 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 11.71110 | 15.0100 | 19.08770 | 26.4278 | 15.388088 | 4.531459 | 0.115708 | -0.767086 | 15.350387 | 3.893553 | 26.4277 |
| ASELS | 48803 | 2012-09-17 06:45:00 | 2019-07-23 07:00:00 | 0.0001 | 4.94030 | 9.2757 | 22.75670 | 46.7616 | 13.432535 | 9.624246 | 0.697246 | -0.917501 | 12.894643 | 8.609648 | 46.7615 |
| ASUZU | 48433 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.07480 | 5.9496 | 7.12000 | 15.2800 | 6.467033 | 2.201036 | 1.482039 | 2.347907 | 6.255543 | 1.596651 | 15.2799 |
| AYGAZ | 48119 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.95150 | 7.7238 | 10.26900 | 13.5935 | 8.101948 | 2.610402 | 0.295531 | -1.063786 | 8.081058 | 2.252556 | 13.5934 |
| BAGFS | 48650 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 8.26175 | 10.6100 | 12.35000 | 38.4352 | 10.407127 | 3.618058 | 2.966101 | 21.320628 | 10.262403 | 2.413341 | 38.4351 |
| BANVT | 47951 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 2.59000 | 3.7100 | 11.93000 | 28.6800 | 7.628230 | 6.267278 | 1.012752 | -0.004878 | 7.079116 | 5.502986 | 28.6800 |
| BRISA | 48937 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.89000 | 6.7300 | 7.33000 | 10.3275 | 6.544896 | 1.295321 | -0.169787 | -0.172206 | 6.542497 | 1.025382 | 10.3274 |
| CCOLA | 48749 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 31.97820 | 34.8215 | 42.04970 | 54.2208 | 36.890707 | 6.747213 | 0.609038 | -0.673140 | 36.662293 | 5.664458 | 54.2207 |
| CEMAS | 46394 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 0.70000 | 0.8700 | 1.50000 | 7.0100 | 1.209088 | 0.799981 | 2.354746 | 7.839219 | 1.107462 | 0.579235 | 7.0100 |
| ECILC | 48492 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.17230 | 1.8214 | 2.78090 | 4.2278 | 2.075865 | 0.973788 | 0.496998 | -1.063596 | 2.041292 | 0.856328 | 4.2277 |
| EREGL | 49173 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 2.18120 | 3.0360 | 6.75870 | 10.4710 | 4.179544 | 2.690731 | 0.572196 | -1.034466 | 4.073754 | 2.360051 | 10.4709 |
| FROTO | 48995 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 21.49380 | 27.1182 | 48.51160 | 65.4192 | 32.763693 | 14.732664 | 0.571357 | -1.040751 | 32.348909 | 12.778048 | 65.4191 |
| GARAN | 49308 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 7.01540 | 7.6542 | 8.67860 | 12.1554 | 7.899734 | 1.249637 | 0.663114 | 0.088658 | 7.852954 | 0.994845 | 12.1553 |
| GOODY | 48961 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 2.42770 | 3.1920 | 3.59660 | 58.7574 | 3.102485 | 0.886456 | 10.197231 | 623.321452 | 3.074397 | 0.653305 | 58.7573 |
| GUBRF | 49057 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 3.27650 | 4.2500 | 5.13000 | 13.6191 | 4.328323 | 1.222988 | 0.501696 | -0.438447 | 4.275643 | 1.026014 | 13.6190 |
| HALKB | 49071 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 8.72050 | 10.6531 | 13.49090 | 20.2365 | 10.919353 | 3.071563 | 0.205094 | -0.601357 | 10.865138 | 2.574486 | 20.2364 |
| ICBCT | 44336 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 1.55960 | 2.0300 | 4.07000 | 11.2700 | 2.828502 | 1.789883 | 1.494058 | 2.611958 | 2.665473 | 1.444402 | 11.2700 |
| ISCTR | 49221 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 4.32000 | 4.8543 | 5.82030 | 7.9639 | 5.126551 | 1.003386 | 0.700135 | -0.574340 | 5.079921 | 0.839786 | 7.9638 |
| ISDMR | 12227 | 2016-03-28 06:30:00 | 2019-07-23 15:00:00 | 1.0181 | 4.85420 | 5.9063 | 6.56070 | 7.5936 | 5.351663 | 1.697918 | -1.227264 | 0.325652 | 5.473200 | 1.318965 | 6.5755 |
| ISFIN | 42877 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 0.56390 | 0.8635 | 1.67420 | 9.8300 | 1.559420 | 1.764839 | 2.167203 | 4.169988 | 1.310759 | 1.253275 | 9.8299 |
| ISYAT | 43184 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 0.44120 | 0.4957 | 0.63330 | 1.1500 | 0.537338 | 0.160246 | 0.529581 | 1.063215 | 0.533277 | 0.123556 | 1.1499 |
| KAREL | 46032 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 1.53130 | 1.8200 | 5.25000 | 9.4600 | 3.178023 | 2.133619 | 0.859170 | -0.689921 | 3.019454 | 1.916274 | 9.4599 |
| KARSN | 48527 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 1.11000 | 1.2874 | 1.47000 | 2.5000 | 1.326907 | 0.290413 | 1.031105 | 1.282541 | 1.306489 | 0.224876 | 2.4999 |
| KCHOL | 49093 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 9.73680 | 12.0449 | 15.16930 | 19.1500 | 12.248291 | 3.181444 | 0.039270 | -1.090014 | 12.245111 | 2.715647 | 19.1499 |
| KRDMB | 47532 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.56120 | 2.2007 | 2.72730 | 4.4960 | 2.222798 | 0.686385 | 0.494980 | -0.497376 | 2.190607 | 0.577845 | 4.4959 |
| KRDMD | 49161 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.08450 | 1.3979 | 2.16900 | 4.9510 | 1.768390 | 0.940092 | 1.368964 | 1.130739 | 1.683724 | 0.732107 | 4.9509 |
| MGROS | 48903 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 16.66000 | 19.1100 | 22.10000 | 30.2600 | 19.576424 | 3.901269 | 0.466080 | -0.338842 | 19.449764 | 3.171047 | 30.2599 |
| OTKAR | 48785 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 56.77570 | 82.8224 | 105.49880 | 139.4288 | 81.419528 | 27.782825 | -0.233590 | -0.890270 | 81.979483 | 23.389661 | 139.4287 |
| PARSN | 45325 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 4.57000 | 7.8900 | 10.65000 | 29.8200 | 8.276989 | 4.662471 | 1.119821 | 1.930259 | 7.923042 | 3.613881 | 29.8200 |
| PETKM | 49184 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.28690 | 2.2845 | 3.88280 | 5.7697 | 2.539237 | 1.378510 | 0.472626 | -1.122838 | 2.477695 | 1.221582 | 5.7696 |
| PGSUS | 45221 | 2013-04-26 06:30:00 | 2019-07-23 15:00:00 | 0.0000 | 17.79000 | 25.6400 | 29.44000 | 50.6500 | 24.789487 | 7.656535 | 0.215107 | -0.321921 | 24.576927 | 6.233082 | 50.6500 |
| PRKME | 48466 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 2.38950 | 2.7400 | 3.43650 | 5.4300 | 2.927109 | 0.721949 | 0.659123 | -0.643581 | 2.898181 | 0.597200 | 5.4299 |
| SAHOL | 49095 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 7.96520 | 8.6079 | 9.26820 | 11.6826 | 8.615896 | 0.955310 | 0.056288 | -0.279519 | 8.608193 | 0.764254 | 11.6825 |
| SASA | 47633 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 0.31920 | 0.7335 | 4.94730 | 8.4260 | 2.294876 | 2.492934 | 0.828215 | -1.016584 | 2.142603 | 2.244777 | 8.4259 |
| SISE | 49090 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 1.92200 | 2.6682 | 4.14600 | 6.9230 | 3.048367 | 1.422848 | 0.511570 | -0.777349 | 2.995630 | 1.223717 | 6.9229 |
| SKBNK | 47270 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.20000 | 1.5100 | 1.72070 | 2.2516 | 1.473651 | 0.294908 | -0.068399 | -1.092395 | 1.475175 | 0.260682 | 2.2515 |
| SODA | 48276 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 1.47580 | 2.6684 | 4.28610 | 7.7659 | 3.189591 | 2.045764 | 0.535910 | -0.656420 | 3.123715 | 1.705455 | 7.7658 |
| TCELL | 49143 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 8.56630 | 9.7001 | 11.23640 | 15.8125 | 9.828003 | 2.356250 | -0.163725 | 0.233760 | 9.865075 | 1.774657 | 15.8124 |
| THYAO | 49282 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 6.43000 | 7.7800 | 12.27000 | 19.9500 | 9.288821 | 4.027293 | 0.932459 | -0.333144 | 9.069076 | 3.279337 | 19.9499 |
| TKFEN | 48930 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 4.31900 | 5.7532 | 14.24675 | 27.3200 | 9.191809 | 6.667047 | 1.168818 | 0.008360 | 8.627737 | 5.586694 | 27.3199 |
| TOASO | 48946 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 10.36560 | 16.5554 | 20.65130 | 29.9218 | 16.597275 | 6.328241 | 0.110278 | -1.010564 | 16.544861 | 5.366081 | 29.9217 |
| TRKCM | 48886 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 1.17420 | 1.6270 | 2.98260 | 4.6432 | 2.027849 | 1.099667 | 0.553203 | -0.990363 | 1.996023 | 0.961232 | 4.6431 |
| TSKB | 48384 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 0.82540 | 0.9373 | 1.02440 | 1.4208 | 0.945233 | 0.155276 | 0.698243 | 0.122133 | 0.937418 | 0.122650 | 1.4207 |
| TTKOM | 49077 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 5.26730 | 5.7464 | 6.26000 | 7.3500 | 5.660680 | 0.818598 | -0.791310 | 0.453484 | 5.701675 | 0.631715 | 7.3499 |
| TUKAS | 45929 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.6500 | 1.06000 | 1.5300 | 2.13000 | 5.9200 | 1.737529 | 0.867095 | 1.693861 | 3.236394 | 1.644571 | 0.631211 | 5.2700 |
| TUPRS | 49143 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 34.54910 | 49.5542 | 93.42870 | 139.2937 | 62.994535 | 32.398117 | 0.680820 | -0.967032 | 61.312098 | 28.679349 | 139.2936 |
| USAK | 47659 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 0.95710 | 1.0500 | 1.37080 | 2.7578 | 1.220452 | 0.459532 | 1.295345 | 0.827238 | 1.184679 | 0.354688 | 2.7577 |
| VAKBN | 49212 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 4.03220 | 4.4742 | 5.24600 | 7.5814 | 4.735438 | 0.977889 | 0.845849 | -0.073249 | 4.685568 | 0.780719 | 7.5813 |
| VESTL | 48781 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0000 | 4.02000 | 6.3200 | 7.45000 | 14.5400 | 5.942711 | 2.830465 | 0.181898 | -0.368116 | 5.838000 | 2.239724 | 14.5400 |
| YATAS | 46055 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 0.38860 | 0.9658 | 4.23000 | 10.6748 | 2.434249 | 2.552377 | 1.068910 | 0.086681 | 2.210920 | 2.197288 | 10.6747 |
| YKBNK | 49225 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 2.26820 | 2.6093 | 2.87400 | 3.9581 | 2.566327 | 0.422774 | -0.240540 | 0.021049 | 2.574394 | 0.341429 | 3.9580 |
| YUNSA | 45528 | 2012-09-17 06:45:00 | 2019-07-23 15:00:00 | 0.0001 | 3.00670 | 4.1078 | 4.72060 | 9.5275 | 4.079695 | 1.347020 | 0.870869 | 0.764225 | 3.990622 | 1.051336 | 9.5274 |
| ZOREN | 48807 | 2012-09-17 06:45:00 | 2019-07-22 15:00:00 | 0.0001 | 1.03380 | 1.2500 | 1.42650 | 2.4430 | 1.248124 | 0.311330 | 0.408987 | 0.273040 | 1.235133 | 0.239848 | 2.4429 |
Almost all of summary stats provide insights about distributions of given stocks. Mean price, standard deviation of price, skewness, kurtosis, quartiles are helpful to understand better how the stock distributes. Outliers also could be seen with these information. However, rather than value-based analysis, i prefer plot-based analysis which gives overall perspective about stocks. Histograms and box-plots could be used to get insights about stocks.
# Stock-based Histogram and its kernel density estimations
g = sns.displot(
data=data, x="price", col="short_name",
col_wrap=5, kde=True, common_bins=False,
facet_kws={'sharex': False, 'sharey': False},
facecolor="lightgreen", color="black"
)
sns.catplot(
data=data, y="price",
col="short_name", kind='box', col_wrap=5, sharey=False
)
<seaborn.axisgrid.FacetGrid at 0x7fa185344370>
From histogram view, while some of stocks looks like leftly skewed or vice versa, it is hard to say anything about others. But in general, multimodal distribution is common among the stock prices
Moreover, from the boxplot view, Median, quartiles (percentile 25th and 75th), interquartile value and outliers were detected. For example, when i look at 'ZOREN', min value were assumed as (25th percentile - 1.5 * IQR), so smaller values than the min value can be labeled as outlier.
In this direction, while some stocks have outliers, other stocks do not include any outlier value.
# long to wide format transformation
data_pivot = data.pivot_table(
index=["timestamp"], columns="short_name", values="price"
).reset_index().rename_axis(None, axis=1)
data_pivot
| timestamp | AEFES | AKBNK | AKSA | AKSEN | ALARK | ALBRK | ANACM | ARCLK | ASELS | ... | TTKOM | TUKAS | TUPRS | USAK | VAKBN | VESTL | YATAS | YKBNK | YUNSA | ZOREN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-09-17 06:45:00 | 22.3978 | 5.2084 | 1.7102 | 3.87 | 1.4683 | 1.1356 | 1.0634 | 6.9909 | 2.9948 | ... | 4.2639 | 0.96 | 29.8072 | 1.0382 | 3.8620 | 1.90 | 0.4172 | 2.5438 | 2.2619 | 0.7789 |
| 1 | 2012-09-17 07:00:00 | 22.3978 | 5.1938 | 1.7066 | 3.86 | 1.4574 | 1.1275 | 1.0634 | 6.9259 | 2.9948 | ... | 4.2521 | 0.96 | 29.7393 | 1.0382 | 3.8529 | 1.90 | 0.4229 | 2.5266 | 2.2462 | 0.7789 |
| 2 | 2012-09-17 07:15:00 | 22.3978 | 5.2084 | 1.7102 | NaN | 1.4610 | 1.1356 | 1.0679 | 6.9909 | 2.9855 | ... | 4.2521 | 0.97 | 29.6716 | 1.0463 | 3.8436 | 1.91 | 0.4229 | 2.5266 | 2.2566 | 0.7789 |
| 3 | 2012-09-17 07:30:00 | 22.3978 | 5.1938 | 1.7102 | 3.86 | 1.4537 | 1.1275 | 1.0679 | 6.9584 | 2.9855 | ... | 4.2521 | 0.97 | 29.7393 | 1.0382 | 3.8529 | 1.91 | 0.4286 | 2.5324 | 2.2619 | 0.7860 |
| 4 | 2012-09-17 07:45:00 | 22.5649 | 5.2084 | 1.7102 | 3.87 | 1.4574 | 1.1356 | 1.0725 | 6.9909 | 2.9760 | ... | 4.2521 | 0.97 | 29.8072 | 1.0382 | 3.8620 | 1.90 | 0.4286 | 2.5324 | 2.2619 | 0.7789 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 50007 | 2019-07-23 14:00:00 | 20.4800 | 7.7300 | 9.1400 | 2.47 | 3.2300 | 1.2100 | 2.8400 | 20.3000 | NaN | ... | 5.6000 | 4.34 | 131.6000 | 1.0500 | 4.8600 | 9.98 | 5.3500 | 2.7500 | 4.2500 | NaN |
| 50008 | 2019-07-23 14:15:00 | 20.5000 | 7.7200 | 9.1400 | 2.47 | 3.2200 | 1.2100 | 2.8400 | 20.3200 | NaN | ... | 5.5700 | 4.35 | 131.5000 | 1.0500 | 4.8600 | 9.98 | 5.3400 | 2.7500 | 4.2400 | NaN |
| 50009 | 2019-07-23 14:30:00 | 20.5000 | 7.7400 | 9.1300 | 2.46 | 3.2300 | 1.2100 | 2.8300 | 20.3400 | NaN | ... | 5.5700 | 4.36 | 131.5000 | 1.0500 | 4.8600 | 9.96 | 5.3400 | 2.7600 | 4.2400 | NaN |
| 50010 | 2019-07-23 14:45:00 | 20.4000 | 7.7000 | 9.1400 | 2.47 | 3.2400 | 1.2100 | 2.8200 | 20.3800 | NaN | ... | 5.5700 | 4.35 | 131.3000 | 1.0400 | 4.8600 | 9.94 | 5.3400 | 2.7700 | 4.2400 | NaN |
| 50011 | 2019-07-23 15:00:00 | 20.4600 | 7.7000 | 9.1400 | 2.47 | 3.2300 | 1.2000 | 2.8300 | 20.3200 | NaN | ... | 5.5600 | 4.34 | 131.8000 | 1.0500 | 4.8500 | 9.93 | 5.3300 | 2.7700 | 4.2400 | NaN |
50012 rows Γ 61 columns
It is known that the dataset also includes NaN observations due to some reasons like public offering dates of stocks. So, stock based sum of nan values were printed below and sorted from higher to lower. ISDMR is the highest nan observations stock. Missing values handling another topic that i have to focus on, but handling missing value is the off-topic for the homework. So, i simply filled these missing values with overall mean of stocks in PCA step by awareing this is completely wrong approach as it could lead to biased comments.
# Number of NaN observations per stocks
data_pivot.isna().sum().sort_values(ascending=False)
ISDMR 37785
ISFIN 7135
ISYAT 6828
ICBCT 5676
PGSUS 4791
...
ISCTR 791
YKBNK 787
THYAO 730
GARAN 704
timestamp 0
Length: 61, dtype: int64
# pair-wise correlation matrix to understanding relationships between stocks
stock_names = [x for x in list(data_pivot) if "timestamp" not in x]
corr_matrix = data_pivot[stock_names].corr()
corr_matrix.head()
| AEFES | AKBNK | AKSA | AKSEN | ALARK | ALBRK | ANACM | ARCLK | ASELS | ASUZU | ... | TTKOM | TUKAS | TUPRS | USAK | VAKBN | VESTL | YATAS | YKBNK | YUNSA | ZOREN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AEFES | 1.000000 | 0.267502 | -0.000625 | 0.528379 | 0.319894 | 0.496981 | 0.117802 | -0.335744 | 0.032488 | 0.442354 | ... | 0.255756 | -0.204048 | -0.100480 | 0.083471 | 0.397543 | -0.150159 | 0.235401 | 0.538889 | 0.475578 | 0.061791 |
| AKBNK | 0.267502 | 1.000000 | 0.572321 | 0.548420 | 0.657648 | 0.165933 | 0.439169 | 0.644439 | 0.583853 | 0.587298 | ... | 0.578765 | 0.315863 | 0.384826 | 0.785438 | 0.942930 | 0.484747 | 0.593489 | 0.516761 | 0.083837 | 0.627474 |
| AKSA | -0.000625 | 0.572321 | 1.000000 | 0.280772 | 0.636207 | -0.045111 | 0.712541 | 0.735731 | 0.844647 | 0.760313 | ... | 0.131398 | 0.561769 | 0.732025 | 0.728294 | 0.512615 | 0.776674 | 0.821989 | -0.169104 | 0.394182 | 0.701427 |
| AKSEN | 0.528379 | 0.548420 | 0.280772 | 1.000000 | 0.470681 | 0.283348 | 0.469720 | 0.002287 | 0.439348 | 0.594102 | ... | 0.211780 | -0.077261 | 0.235223 | 0.581016 | 0.606355 | 0.136128 | 0.456702 | 0.375938 | 0.451502 | 0.371093 |
| ALARK | 0.319894 | 0.657648 | 0.636207 | 0.470681 | 1.000000 | 0.158375 | 0.736034 | 0.464037 | 0.752573 | 0.719034 | ... | 0.080230 | 0.498251 | 0.655195 | 0.652558 | 0.653074 | 0.509264 | 0.791923 | 0.161212 | 0.374540 | 0.391374 |
5 rows Γ 60 columns
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
fig, ax = plt.subplots(figsize=(15, 12))
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
square=True, linewidths=.7, cbar_kws={"shrink": .4})
<AxesSubplot:>
By investigating heatmap plot, i chose three pairs of stocks in terms of having positive, negative and no correlation. In each category, randomly selected pairs are just representation of their category. Selection decision was not made by considering their correlation values, was made by just their cell colors
- SISE & TEKFN which have positive correlation between each other
- TCELL & GUBRF which do not have any correlation between each other and,
- ISDMR & BAGFS which have negative correlation between each other.
Task 2 - Moving Window CorrelationΒΆ
Due to interesting pairs of stocks selected from the previous section, i will dive deep with these pairs of stocks in moving window correlation step.
1 Month, 3 Month and 6 Month moving window period were selected to analyse rolling correlation of these pairs of stocks.
PERIOD_SPACE = ["30D", "90D", "180D"]
PAIR1 = ("SISE", "TKFEN")
PAIR2 = ("TCELL", "GUBRF")
PAIR3 = ("ISDMR", "BAGFS")
def calculate_rolling_corrs(
data: pd.DataFrame, stock_pairs: tuple, period_space: list
) -> pd.DataFrame:
"""
returns a dataframe that includes rolling correlation vals over desired window size
"""
stock1, stock2 = stock_pairs
rolling_ops_df = data.set_index("timestamp")
rolling_cors = []
for prd in period_space:
temp_rolling_cor = (rolling_ops_df[stock1]
.rolling(prd)
.corr(rolling_ops_df[stock2])
.reset_index(name="Rolling Corr Values")
)
temp_rolling_cor["Window Size"] = f"{prd} Rolling Corr"
rolling_cors.append(temp_rolling_cor)
rollings_cors_df = pd.concat(rolling_cors, ignore_index=True)
rollings_cors_df = rollings_cors_df.pivot_table(
index="timestamp", columns="Window Size", values="Rolling Corr Values", dropna=False
).reset_index().rename_axis(None, axis=1)
return rollings_cors_df
rolling_corrs_pair1 = calculate_rolling_corrs(
data_pivot, PAIR1, PERIOD_SPACE
)
ax = rolling_corrs_pair1.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR1} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
rolling_corrs_pair2 = calculate_rolling_corrs(
data_pivot, PAIR2, PERIOD_SPACE
)
ax = rolling_corrs_pair2.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR2} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
rolling_corrs_pair3 = calculate_rolling_corrs(
data_pivot, PAIR3, PERIOD_SPACE
)
ax = rolling_corrs_pair3.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR3} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
It's time to evaluate the these 3 pairs of stocks with respect to plot outputs
(SISE, TKFEN)
(TCELL, GUBRF)
(ISDMR, BAGFS)
All of mentioned pairs have common trends in 1M, 3M and 6M rolling correlation lines. All of them are sometimes goes up and sometimes goes down. Certainly, their characteristics differs from each other. In short, seasonal patterns can be seen from the plots. This may originated from market manipulation, news about the company sectors or some bureucratic talks. These patterns will be validated in Google Trends analysis.
Task 3 - PCA ApplicationΒΆ
Dimensionality reduction method was applied via Principal Component Analysis (PCA). Before the application of PCA, standard scaling of features are essential to fair model evaluation. However, imputing nan observations is the first step for this task because PCA does not work with nan observations.
imputer = SimpleImputer(strategy="mean")
imputed_data = imputer.fit_transform(data_pivot[stock_names])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)
scaled_data
array([[ 0.57857848, -1.34926492, -2.02795754, ..., -0.05370799,
-1.41440415, -1.52566733],
[ 0.57857848, -1.3648411 , -2.02930519, ..., -0.09471605,
-1.42662012, -1.52566733],
[ 0.57857848, -1.34926492, -2.02795754, ..., -0.09471605,
-1.41852801, -1.52566733],
...,
[-0.19710197, 1.3516016 , 0.74963037, ..., 0.46175389,
0.12473097, 0. ],
[-0.23797458, 1.30892714, 0.75337385, ..., 0.48559579,
0.12473097, 0. ],
[-0.21345102, 1.30892714, 0.75337385, ..., 0.48559579,
0.12473097, 0. ]])
pca_model = PCA_Function(random_state=3169)
pca_model.fit(scaled_data)
PCA(random_state=3169)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(random_state=3169)
pca_summary = pd.DataFrame(
np.vstack(
[pca_model.get_feature_names_out(),
pca_model.explained_variance_ratio_,
np.cumsum(pca_model.explained_variance_ratio_)]
)).T
pca_summary.columns = ["PCA Components", "Proportion of Variance", "Cumulative Proportion"]
pca_summary[:20]
| PCA Components | Proportion of Variance | Cumulative Proportion | |
|---|---|---|---|
| 0 | pca0 | 0.488094 | 0.488094 |
| 1 | pca1 | 0.170283 | 0.658377 |
| 2 | pca2 | 0.107268 | 0.765646 |
| 3 | pca3 | 0.049891 | 0.815537 |
| 4 | pca4 | 0.038081 | 0.853618 |
| 5 | pca5 | 0.021406 | 0.875023 |
| 6 | pca6 | 0.015324 | 0.890347 |
| 7 | pca7 | 0.013280 | 0.903627 |
| 8 | pca8 | 0.010235 | 0.913862 |
| 9 | pca9 | 0.008878 | 0.922740 |
| 10 | pca10 | 0.008343 | 0.931083 |
| 11 | pca11 | 0.007107 | 0.938190 |
| 12 | pca12 | 0.006225 | 0.944415 |
| 13 | pca13 | 0.005945 | 0.950360 |
| 14 | pca14 | 0.004796 | 0.955156 |
| 15 | pca15 | 0.004549 | 0.959705 |
| 16 | pca16 | 0.003477 | 0.963182 |
| 17 | pca17 | 0.003088 | 0.966270 |
| 18 | pca18 | 0.002632 | 0.968903 |
| 19 | pca19 | 0.002611 | 0.971514 |
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(x=pca_summary["PCA Components"], height=pca_summary["Proportion of Variance"])
ax.set_ylabel("Proportion of Variance")
ax.set_xlabel("Components")
ax.set_title("Proportion of Variance per PCA Components")
ax2 = ax.twinx()
ax2.plot(pca_summary["PCA Components"], pca_summary["Cumulative Proportion"], label="Cumulative Variance", color="red", marker="o")
ax2.legend()
fig.autofmt_xdate(rotation=90)
col_names = [f"pca{x}" for x in range(0, 60)]
pca_loadings = pd.DataFrame(pca_model.components_, columns=col_names)
pca_loadings.insert(0, "Stocks", list(data_pivot[stock_names]))
pca_loadings.set_index("Stocks")
pca_loadings.head()
| Stocks | pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | ... | pca50 | pca51 | pca52 | pca53 | pca54 | pca55 | pca56 | pca57 | pca58 | pca59 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AEFES | -0.010619 | -0.115330 | -0.162991 | -0.074086 | -0.139545 | -0.008576 | -0.168831 | -0.124655 | -0.176649 | ... | 0.011291 | -0.115707 | -0.169414 | -0.130355 | -0.097293 | -0.157651 | -0.168918 | 0.037678 | -0.087511 | -0.131610 |
| 1 | AKBNK | -0.164429 | -0.212004 | -0.033005 | -0.159788 | -0.081720 | -0.112583 | 0.058363 | -0.032705 | 0.003778 | ... | -0.268609 | 0.065278 | 0.096402 | -0.142497 | -0.249343 | 0.035116 | -0.017233 | -0.279360 | -0.000860 | -0.097437 |
| 2 | AKSA | -0.261382 | 0.073484 | 0.068234 | -0.147875 | -0.106180 | -0.203374 | -0.094810 | 0.251670 | 0.005959 | ... | 0.110387 | 0.057695 | -0.011123 | 0.079051 | 0.022534 | 0.082714 | -0.071226 | -0.055424 | -0.267015 | 0.069573 |
| 3 | AKSEN | 0.014416 | -0.094138 | 0.076323 | -0.196707 | -0.158188 | 0.248636 | -0.086114 | -0.038360 | -0.090251 | ... | 0.114039 | 0.104382 | -0.033534 | -0.177370 | -0.075411 | 0.135683 | -0.045575 | 0.010490 | 0.188490 | 0.207352 |
| 4 | ALARK | 0.015394 | -0.105984 | 0.066446 | 0.236019 | -0.223156 | -0.122610 | -0.011328 | -0.123666 | 0.063168 | ... | -0.066522 | -0.320755 | -0.067071 | 0.091815 | -0.102075 | -0.027652 | 0.004609 | -0.123473 | 0.161817 | 0.143059 |
5 rows Γ 61 columns
# Factor loadings heatmap
pca_loadings_corr_matrix = pca_loadings.set_index("Stocks")
mask_pca = np.triu(np.ones_like(pca_loadings_corr_matrix, dtype=bool))
fig, ax = plt.subplots(figsize=(15, 12))
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(pca_loadings_corr_matrix, cmap=cmap, vmax=pca_loadings_corr_matrix.max().max(), vmin=pca_loadings_corr_matrix.min().min(), center=0,
square=True, linewidths=.7, cbar_kws={"shrink": .4})
<AxesSubplot:ylabel='Stocks'>
If I have to discuss briefly of the PCA outputs, Proportion of Variance (explained variance) plot provides an intuition about which components represents data variability. So, First 9 components of PCA represents nearly 90% of dataset. Also, factor loadings heatmap gives information about stock correlation per components. For example, in component 0, AYGAZ, BAGFS, BRISA, CEMAS, ISYAT have higher correlation than other stocks. It also means that these type of stocks contributes more than others in selected PCA component. With the domain knowledge such as sector of these companies or strategic partnerships among these stocks would describe more efficiently in component based correlations. Components could capture these spesific information.
Task 4 - Google Trends AnalysisΒΆ
Google Trends insights were used to validate personal observations from the applied analysis above.
def get_gtrend_rolling_correlation(pair):
"""processed google trends raw data for calculating 6-month rolling correlation"""
stock1, stock2 = pair
filename = "_".join(list(pair))
data = pd.read_csv(DATA_LOCAL_PATH + f"multiTimeline_{filename}.csv").reset_index().iloc[1:, :]
data.columns = ["date", f"{stock1}_Popularity", f"{stock2}_Popularity"]
data["date"] = pd.to_datetime(data.date)
data.set_index("date", inplace=True)
gtrend_rolling_corr = data[f"{stock1}_Popularity"].rolling(6).corr(data[f"{stock2}_Popularity"]).reset_index(name="gtrend_val")
return gtrend_rolling_corr
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR1)
rolling_corrs_pair1 = calculate_rolling_corrs(
data_pivot, PAIR1, PERIOD_SPACE
)
ax = rolling_corrs_pair1.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR1} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR2)
rolling_corrs_pair2 = calculate_rolling_corrs(
data_pivot, PAIR2, PERIOD_SPACE
)
ax = rolling_corrs_pair2.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR2} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR3)
rolling_corrs_pair3 = calculate_rolling_corrs(
data_pivot, PAIR3, PERIOD_SPACE
)
ax = rolling_corrs_pair3.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR3} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
Although rolling correlation of stock pair trends are generally similar, their values vary most often. The reason why this situation occurs may be "search volume and financial market dynamics completely different from each other."
ConclusionΒΆ
We attempted to thoroughly analyze 60 Borsa Istanbul stocks and their closing prices from nearly 2012 to 2019 for this assignment. We mainly focused on descriptive analysis, moving correlation analysis with selected pairs of stocks, and then dimensionality reduction (PCA). Finally, using the information about relationship between selected pairs of stocks we gathered from Google Trend, we attempted to draw a conclusion.